feat(e2e_eval): add --build-only mode with per-EP matrix, export dedup, and Azure Artifacts upload by KayMKM · Pull Request #845 · microsoft/winml-cli

KayMKM · 2026-06-09T07:57:06Z

Summary

Adds a --build-only mode to scripts/e2e_eval/run_eval.py that generates ModelKit pipeline artifacts (export → optimize → quantize, no compile) across the full EP matrix, without requiring EP hardware, and optionally streams them to the Modelkit Azure Artifacts feed while bounding local disk usage.

Motivation: we need a way to mass-produce per-EP model artifacts for ~200 models on a single (CPU) box and distribute them, but (a) the build normally needs the target EP installed, and (b) writing every stage for every (model × EP) fills the disk fast.

What's included

1. `--build-only` mode + per-EP matrix

Runs winml config + winml build --no-compile per model; perf/accuracy are skipped.
When --ep/--device are omitted, builds the eval EP matrix into <model_dir>/<ep>_<device>/ subdirs: qnn_npu, qnn_gpu, ov_cpu, ov_npu, ov_gpu, mlas_cpu, dml_gpu, vitisai_npu. Pinning --ep/--device writes a single build directly into <model_dir>.
Precision per combo reuses the existing eval policy (NPU → w8a16, CPU/GPU → auto, native-quant EPs → --no-quant).

2. Cross-EP / cross-host builds (core fix in `build.py`)

_run_optimize_stage called resolve_device(device, ep=ep) purely to pick a progress-bar key, which raised when the target EP wasn't installed locally — blocking offline generation of (e.g.) QNN/VitisAI artifacts on a CPU box.
Now: when the build won't compile (config.compile is None), the missing-EP lookup soft-fails and falls back to the requested device (optimize/quantize only need the EP's static rule data, not a registered EP). When compile will run, it still fails fast. No behavior change for normal compiling builds.

3. Export dedup (disk saver)

The export.onnx stage is EP/device-independent, so all 8 combos produce an identical export. After each combo builds, its export is hash-compared against a per-model canonical: the first is moved to <model_dir>/_shared/, later identical ones are deleted — one export copy instead of 8 (export is the largest, full-precision artifact).

4. `--upload`: stream to Azure Artifacts feed, then delete locally

After a model's combos are built, publishes the whole model dir to the Modelkit feed as a Universal Package, then deletes the local copy — peak disk stays at ~one model's matrix.
Auth via az login (Entra ID), no PAT. The azure-devops extension is ensured and login verified up front; if not ready, the run aborts (so disk isn't silently filled).
Package: single name winml-cli-models, one version per model: 0.0.0-<run-stamp>-<model-slug> (valid SemVer 2.0; the 0.0.0- core keeps it legal, the date stamp + slug are the pre-release segment). The shared run-stamp groups a batch.
Resume: --continue + the same --run-stamp skips already-uploaded models without rebuilding them. Already-uploaded models are detected two ways: a local build_only_uploads.json manifest, and a query against the feed itself at startup (versions matching 0.0.0-<run-stamp>-*). Because the manifest is only written after a successful upload, a fresh --output-dir would otherwise start empty and rebuild everything; seeding from the feed makes it authoritative for what's published, so resume works regardless of local state. A feed-query failure falls back to local-manifest-only behavior.
Extra flags: --run-stamp, --keep-local, --upload-skip-existing, --feed/--feed-org/--feed-project/--package-name.

Usage

# Build the full EP matrix for P0 models, stream to the feed, delete locals
uv run python scripts/e2e_eval/run_eval.py --build-only --upload --priority P0

# Resume an interrupted batch (same run-stamp; a fresh output dir is fine)
uv run python scripts/e2e_eval/run_eval.py --build-only --upload --continue \
  --run-stamp 20260609 --priority P0

# Local only (no feed), or pin a single EP/device
uv run python scripts/e2e_eval/run_eval.py --build-only --hf-model microsoft/resnet-50

Download a specific model's specific file later:

az artifacts universal download \
  --organization https://dev.azure.com/microsoft --project windows.ai.toolkit \
  --scope project --feed Modelkit --name winml-cli-models \
  --version 0.0.0-20260609-microsoft-resnet-50-image-classification \
  --path ./out --file-filter 'qnn_npu/model.onnx*'

Notes / scope

src/winml/modelkit/commands/build.py change is gated on config.compile is None, so it only affects no-compile builds; compiling builds are unchanged and still fail fast on a missing EP.
--continue resumes from the feed (versions matching 0.0.0-<run-stamp>-*) and the local build_only_uploads.json manifest, so a resumed run only needs the same --run-stamp — it no longer has to reuse the same --output-dir. The feed query uses two &-free REST GETs (list packages → resolve the UPack package GUID → list versions) because az resolves to az.cmd and cmd.exe splits query strings on &, dropping every parameter after the first.
Verified manually end-to-end on a CPU host: full 8-EP matrix build, export dedup, publish to Modelkit, and --file-filter download of individual model.onnx files.
The feed: https://dev.azure.com/microsoft/windows.ai.toolkit/_artifacts/feed/Modelkit/UPack/winml-cli-models/overview/0.0.0-20260609-prajjwal1-bert-tiny-text-classification

Add a --build-only mode to run_eval.py that runs config + build with --no-compile, writing each pipeline stage's ONNX (export/optimize/quantize) without requiring execution-provider hardware. Perf and accuracy are skipped. When --ep/--device are omitted, every model is built once per EP in the build-only matrix (qnn npu/gpu, openvino cpu/npu/gpu, mlas, dml, vitisai) into <model_dir>/<ep>_<device>/ subdirs. When either is pinned, a single build is written directly into the model dir. Precision per combo reuses the existing _resolve_precision policy (NPU w8a16, CPU/GPU auto, native-quant EPs unquantized). Reuses the existing _run_build via a build_only flag (-o <dir> --no-compile instead of --use-cache).

Two bugs surfaced when running `run_eval.py --build-only` against the EP matrix on a CPU-only host: 1. Every combo for the 'no native EP' subset (mlas/dml/openvino) was reported as `[FAIL @ complete]` even though export/optimize/quantize/model.onnx all landed correctly. `_run_build` was funnelling build-only results through `_extract_onnx_path`, which scans stdout for a `Final artifact:` marker that `winml build --no-compile` never prints, and falls back to the global cache which build-only doesn't populate (`-o <dir>` writes elsewhere). In build-only mode there is no downstream consumer of the path, so trust exit-code 0 directly and record `build_out` to keep the per-component bookkeeping balanced. 2. QNN/VitisAI combos failed at the optimize stage with `Requested EP 'qnn' is not available on this system`. `_run_optimize_stage` calls `resolve_device(device, ep=ep)` purely to pick the right `has_rule_data_for_ep` key for the progress bar, but that helper raises when the EP isn't installed locally -- even when the rest of the pipeline (export + optimize + quantize) runs on CPU and the EP is only needed at compile time. Soft-fail the lookup *only when* `config.compile is None` (i.e. `--no-compile` or a config that explicitly opts out); otherwise re-raise so configs that will compile still fail fast here instead of deep inside the compile stage. Also moves `--clean-cache` from per-combo to per-model in `_run_build_only`: combos for the same model share the same HF download, so clearing between combos forced N redundant re-downloads of the same weights.

…facts feed Running --build-only over the 8-EP matrix for many models fills local disk. Two additions keep disk bounded: 1. Export dedup: the export.onnx stage is EP/device-independent, so every combo produces an identical export. After each combo builds, its export is hash-compared against a per-model canonical: the first is moved to <model_dir>/_shared/, later identical ones are deleted. One export copy on disk instead of 8. 2. --upload: after a model's combos are built, publish the model dir to the Modelkit Azure Artifacts feed as a Universal Package version, then delete it locally. Auth via az login (no PAT); the azure-devops extension is ensured and login verified up front (aborts otherwise so disk isn't silently filled). Version is 0.0.0-<run-stamp>-<model-slug> (valid SemVer 2.0; date stamp groups a batch). --continue + --run-stamp resume an interrupted batch from the build_only_uploads.json manifest without rebuilding uploaded models; --keep-local, --upload-skip-existing, and feed/package args round it out.

DingmaomaoBJTU

Reviewed the full diff. Overall structure is solid and docstrings are thorough. Found a potential data-loss path (false-positive conflict detection triggers local directory deletion), a silent no-op for --continue in the non-upload build-only path, and a few other correctness/usability issues — see inline comments.

… versions The --continue skip logic only consulted the local build_only_uploads.json manifest, which is written after each successful upload. A fresh --output-dir (e.g. a gitignored temp dir) starts empty, so models already published to the Azure Artifacts feed under the same run-stamp were rebuilt and re-uploaded instead of being skipped. Seed the in-memory manifest from the feed at startup: query the feed REST API for versions matching 0.0.0-<run-stamp>-* and mark them as uploaded so the existing skip check honors them. The feed is now authoritative for what's published, regardless of local state. Querying is best-effort -- a failure falls back to local-manifest-only behavior. Use two ampersand-free REST GETs (list packages -> resolve UPack package GUID -> list versions) because az resolves to az.cmd and cmd.exe splits query strings on '&', dropping every parameter after the first.

- _hash_files: stop hashing unreadable files to a fixed sentinel; propagate OSError and have _dedup_export keep the export in place instead of risking deletion of an artifact never verified identical. - _is_publish_conflict: narrow detection to specific version-exists / HTTP 409 markers (drop bare 'conflict'/'409') so an unrelated message can't trigger exists-skipped and rmtree the local model dir. - build.py _run_optimize_stage: narrow the no-compile EP fallback to only swallow EP-not-available ValueErrors; re-raise malformed device/EP names. - Warn when --continue is used with --build-only but without --upload (no local-disk resume exists, so everything is rebuilt). - Document that the pinned-EP auto-device path delegates precision to winml config's auto-detection. - Fix misleading --upload-skip-existing help: it does not skip the build.

When a mid-run upload failed because Azure CLI was unavailable (not logged in, token expired, or az hung and was killed), the model was marked 'failed' and kept locally while the batch continued. Every subsequent model hit the same az failure and its local copy piled up, filling the disk (a single 7B LLM matrix is ~450 GB). Add _is_az_unavailable() to distinguish a host-level az/login problem from a per-package publish error (network blip or version conflict), and abort the run immediately (exit 3) when an upload fails for that reason. The user re-runs 'az login' and resumes with --continue + the same --run-stamp; already-uploaded models are skipped.

Uploading all 8 EP/device combos as one Universal Package version timed out on large models and left the local artifacts behind, filling the disk. Each combo is now built, uploaded, and deleted on its own. - Version is per combo: 0.0.0-<run-stamp>-<ep>-<device>-<model-slug>, so each package is small (lower timeout risk) and can be retried/resumed on its own. - Local artifacts are removed after every outcome (uploaded, version-exists, timeout, upload-failed, build-failed) unless --keep-local, so disk stays bounded. A timeout/failure is recorded and the run continues; only a host-level az failure (not logged in / token expired) aborts. - Every (model, combo) outcome is written to build_only_results.json (replaces build_only_uploads.json) for auditing, with or without --upload; it also drives per-combo --continue resume. - Export dedup now applies only without --upload (each uploaded combo is self-contained). Adds _classify_upload helper and unit tests for the version format, results I/O, az-unavailable classification, and the per-combo cleanup/timeout/abort orchestration.

…atrix # Conflicts: # src/winml/modelkit/commands/build.py

DingmaomaoBJTU

v2 re-review: the previous five issues are all addressed -- OSError propagation in _hash_files, narrow conflict markers, --continue warning, per-combo upload design, and the build.py ep_unavailable guard. Remaining findings: one correctness bug (pagination truncation in _fetch_feed_versions silently breaks --continue resume on large feeds) and two lower-severity concerns.

- _fetch_feed_versions: return None (not an empty set) when the package is absent from the /packages listing. That listing is paginated (~25/page default) and $top can't be appended here -- a second query param needs `&`, which az.cmd/cmd.exe splits -- so an empty set was indistinguishable from a pagination miss and silently rebuilt the whole batch on --continue. None takes the explicit "could not query feed" fallback instead. - _is_az_unavailable: narrow the broad 'refresh token' marker to 'invalid_grant' (the OAuth2 code Azure AD emits for an expired/revoked refresh token), so an informational MSAL "Refreshing token..." line on a transient exit-1 can no longer abort the entire run. - Document that --upload-skip-existing is the safe flag on a --continue resume: a timed-out upload may have committed server-side, so the retry's 409 should count as done, not failed (README example + flags table, the flag's help text, and the _classify_upload docstring). Adds unit tests for the narrowed marker and for _fetch_feed_versions not-found -> None / matching-versions / query-failure.

CodeQL flagged the best-effort 'remove now-empty model dir' cleanup in _run_build_only as an empty 'except OSError: pass'. Add an explanatory comment (no behaviour change) so the intent -- ignoring a non-empty/locked dir -- is documented.

KayMKM added 3 commits June 5, 2026 16:44

KayMKM requested a review from a team as a code owner June 9, 2026 07:57

KayMKM changed the title ~~Yuesu/build only ep matrix~~ feat(e2e_eval): add --build-only mode with per-EP matrix, export dedup, and Azure Artifacts upload Jun 9, 2026

Merge branch 'main' into yuesu/build-only-ep-matrix

54a2449

DingmaomaoBJTU reviewed Jun 9, 2026

View reviewed changes

KayMKM added 6 commits June 10, 2026 12:03

Merge branch 'main' into yuesu/build-only-ep-matrix

d2ac073

Merge remote-tracking branch 'origin/main' into yuesu/build-only-ep-m…

22dbeb3

…atrix # Conflicts: # src/winml/modelkit/commands/build.py

DingmaomaoBJTU reviewed Jun 26, 2026

View reviewed changes

Comment thread scripts/e2e_eval/run_eval.py

Comment thread scripts/e2e_eval/run_eval.py Outdated

Comment thread scripts/e2e_eval/run_eval.py

Comment thread src/winml/modelkit/commands/build.py Outdated

github-advanced-security AI found potential problems Jun 26, 2026

View reviewed changes

Comment thread scripts/e2e_eval/run_eval.py Fixed

KayMKM added 3 commits June 26, 2026 15:25

Merge branch 'main' into yuesu/build-only-ep-matrix

d53d067

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(e2e_eval): add --build-only mode with per-EP matrix, export dedup, and Azure Artifacts upload#845

feat(e2e_eval): add --build-only mode with per-EP matrix, export dedup, and Azure Artifacts upload#845
KayMKM wants to merge 13 commits into
mainfrom
yuesu/build-only-ep-matrix

KayMKM commented Jun 9, 2026 •

edited

Loading

Uh oh!

DingmaomaoBJTU left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DingmaomaoBJTU left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

KayMKM commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's included

1. --build-only mode + per-EP matrix

2. Cross-EP / cross-host builds (core fix in build.py)

3. Export dedup (disk saver)

4. --upload: stream to Azure Artifacts feed, then delete locally

Usage

Notes / scope

Uh oh!

DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

KayMKM commented Jun 9, 2026 •

edited

Loading

1. `--build-only` mode + per-EP matrix

2. Cross-EP / cross-host builds (core fix in `build.py`)

4. `--upload`: stream to Azure Artifacts feed, then delete locally